Exercise 1

About the project

This course will be an introduction to R and Github.

Here is the link to my IODS page
link


Exercise 2 Data analysis

Read the dataset from the local folder and analyze the structure of the dataset

setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")  
learning2014 <- read.csv("learning2014.csv", header=TRUE)  
str(learning2014)
## 'data.frame':    166 obs. of  8 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Deep    : int  43 35 42 42 44 57 46 39 52 48 ...
##  $ Stra    : int  27 22 29 25 29 29 18 32 34 28 ...
##  $ Surf    : int  31 38 27 27 34 29 23 34 26 36 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)  
## [1] 166   8

The “learning2014” dataset constitutes of 166 observations and 8 variables. The variables (gender, Age, Attitude, Deep, Stra, Surf and Points) describe the results of a query study made as a collaboration international project with teachers.

  • Gender = 1 corresponds to male and 2 to female students
  • Age = the age of the students
  • Attitude = Global attitude toward statistics, computed as a sum of numerous variables
  • Deep = Deep approach, computed as a sum of numerous variables
  • Stra = Surface approach, computed as a sum of numerous variables
  • Surf = Surface approach, computed as a sum of numerous variables
  • Points = Max points of social studies

Show a graphical overview of the data and show summaries of the variables in the data

summary(learning2014)
##        X          gender       Age           Attitude          Deep      
##  Min.   :  1.00   F:110   Min.   :17.00   Min.   :14.00   Min.   :19.00  
##  1st Qu.: 42.25   M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:40.00  
##  Median : 83.50           Median :22.00   Median :32.00   Median :44.00  
##  Mean   : 83.50           Mean   :25.51   Mean   :31.43   Mean   :44.16  
##  3rd Qu.:124.75           3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:49.00  
##  Max.   :166.00           Max.   :55.00   Max.   :50.00   Max.   :59.00  
##       Stra            Surf           Points     
##  Min.   :10.00   Min.   :19.00   Min.   : 7.00  
##  1st Qu.:21.00   1st Qu.:29.00   1st Qu.:19.00  
##  Median :25.50   Median :34.00   Median :23.00  
##  Mean   :24.97   Mean   :33.45   Mean   :22.72  
##  3rd Qu.:29.00   3rd Qu.:38.00   3rd Qu.:27.75  
##  Max.   :40.00   Max.   :52.00   Max.   :33.00

There are 110 female and 56 male students. The median age is 22 years and range [17-55]. Students got a median of 44 points [19-59] in the deep approach (Deep), 25.50 [10-40] in the strategic approach (Stra), 34 points [19-52] in the surface approach (Surf). Their max points were 23 [7-33] as a median (Points).

pairs(learning2014[-1], col=learning2014$gender)

The data is visualised according to gender (males as black dots and females as red dots) on multiple bivariable correlation scatter plots.

The data can also be visualised using the “GGally” and “ggplot2” packages

library("GGally")  
## Warning: package 'GGally' was built under R version 3.3.2
library("ggplot2")  
## Warning: package 'ggplot2' was built under R version 3.3.2
p <- ggpairs(learning2014[-1], mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))  
p

Now we can see that male students have a higher score in attitude questions and female students in strategic questions. We can also see that Variables Points and Attitude have the highest (cor: 0.43) and Surf and Deep second highest correlation factor (cor: 0.32).

Regression model

The “Points” student achieve is tested with a multivariate linear regression model using Attitude, Stra and Surf as independent variables. The model tries to describe any linear correlation between the independent and dependent variables.

points_regression <- lm(Points ~ Attitude + Stra + Surf, data = learning2014)  
summary(points_regression)  
## 
## Call:
## lm(formula = Points ~ Attitude + Stra + Surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## Stra         0.10664    0.06770   1.575  0.11716    
## Surf        -0.04884    0.06678  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

In the first model with all three variables, Attitude explains the Points with a coefficient estimate of 0.34 (p<0.0001). Other variables Stra and Surf do not explain Points with a statistical significance. Thus, first Surf is removed and the model is tested again.

points_regression <- lm(Points ~ Attitude + Stra, data = learning2014)  
summary(points_regression)    
## 
## Call:
## lm(formula = Points ~ Attitude + Stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## Attitude     0.34658    0.05652   6.132 6.31e-09 ***
## Stra         0.11421    0.06681   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Attitude is still the only relevant variable and, thus, Stra is removed and the model is run with Attitude alone (coefficient 0.35, p<0.0001).

points_regression <- lm(Points ~ Attitude, data = learning2014) 
summary(points_regression)      
## 
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The adjusted R-squared score is 0.19 meaning that using this model, Attitude explains 19% of the total variance of the dependent variable Points.

Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage

plot(points_regression, which = c(1,2,5))

Residuals of the linear regression model describe the validity of the model. Residuals should be normally distributed, they should not correlate with each others, have a constant varaiance and their size should not depend on the size of the independent variable.
The QQ-plot describes the distribution of the residuals. Our model is normally divided as the measurements are set on straight increasing line.
The Residuals vs Fitted plot describes how the residuals depend on the explanatory variable. According to the plot, the resudials are resonably evenly distributed on both sides of the residual level = 0 line and, thus, do not depend on the size of the explanatory variable.
The Residuals vs Leverage plot can help identify, which observations have an unusually high impact on the model. According to the plot, all measurements are divided between [0,0.05] and thus none have an unusual impact on the model.


Exercise 3 Data analysis

Read the dataset and analyze the structure of the dataset

setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
alc <- read.csv("alc.csv", header = TRUE)
colnames(alc)
##  [1] "X"          "school"     "sex"        "age"        "address"   
##  [6] "famsize"    "Pstatus"    "Medu"       "Fedu"       "Mjob"      
## [11] "Fjob"       "reason"     "nursery"    "internet"   "guardian"  
## [16] "traveltime" "studytime"  "failures"   "schoolsup"  "famsup"    
## [21] "paid"       "activities" "higher"     "romantic"   "famrel"    
## [26] "freetime"   "goout"      "Dalc"       "Walc"       "health"    
## [31] "absences"   "G1"         "G2"         "G3"         "alc_use"   
## [36] "high_use"
dim(alc)
## [1] 382  36

The data constitutes of 382 observations and 36 variables. Their names are presented in the above and describe the correlation between alcohol usage and the social, gender and study time attributes for each student.

Data Set Information:
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

Attribute Information:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 nursery - attended nursery school (binary: yes or no)
13 internet - Internet access at home (binary: yes or no)
14 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
15 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
16 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
17 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
18 schoolsup - extra educational support (binary: yes or no)
19 famsup - family educational support (binary: yes or no)
20 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
21 activities - extra-curricular activities (binary: yes or no)
22 higher - wants to take higher education (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
31 G1 - first period grade (numeric: from 0 to 20)
32 G2 - second period grade (numeric: from 0 to 20)
33 G3 - final grade (numeric: from 0 to 20, output target)
34 alc_use - average of ‘Dalc’ and ‘Walc’
35 high_use - TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise

Personal hypothesis of the data

High amount of ‘absences’, no ‘famsup’, low ‘studytime’ and bad ‘famrel’ correlates with a higher alcohol consumption

Numerically and graphically explore the distributions of your variables

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
g1 <- ggplot(alc, aes(x = alc_use, y = absences, fill = sex))
g1 + geom_bar(stat="identity") + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")

g2 <- ggplot(alc, aes(x = alc_use, y = (famsup = "yes"), fill=sex))
g2 + geom_bar(stat="identity") + ylab("family educational support") + ggtitle("Student family educational support by alcohol consumption")

g3 <- ggplot(data = alc, aes(x = alc_use, y = studytime, fill=sex))
g3 + geom_bar(stat="identity") + ylab("weekly study time") + ggtitle("Student weekly study time by alcohol consumption and sex")

g4 <- ggplot(alc, aes(x = alc_use, y = famrel, fill = sex))
g4 + geom_bar(stat="identity") + ylab("quality of family relationships") + ggtitle("Student quality of family relationships by alcohol consumption and sex")

alc %>% group_by(famsup) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 2 × 3
##   famsup count mean_alc_use
##   <fctr> <int>        <dbl>
## 1     no   144     1.965278
## 2    yes   238     1.842437
alc %>% group_by(alc_use) %>% summarise(count = n(), mean_absences = mean(absences))
## # A tibble: 9 × 3
##   alc_use count mean_absences
##     <dbl> <int>         <dbl>
## 1     1.0   140      3.357143
## 2     1.5    69      4.231884
## 3     2.0    59      3.915254
## 4     2.5    44      6.431818
## 5     3.0    32      6.093750
## 6     3.5    17      5.647059
## 7     4.0     9      6.000000
## 8     4.5     3     12.000000
## 9     5.0     9      6.888889
alc %>% group_by(alc_use) %>% summarise(count = n(), mean_studytime = mean(studytime))
## # A tibble: 9 × 3
##   alc_use count mean_studytime
##     <dbl> <int>          <dbl>
## 1     1.0   140       2.307143
## 2     1.5    69       1.971014
## 3     2.0    59       1.983051
## 4     2.5    44       1.931818
## 5     3.0    32       1.718750
## 6     3.5    17       1.470588
## 7     4.0     9       1.777778
## 8     4.5     3       2.000000
## 9     5.0     9       1.666667
alc %>% group_by(famrel) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 5 × 3
##   famrel count mean_alc_use
##    <int> <int>        <dbl>
## 1      1     8     2.250000
## 2      2    19     2.184211
## 3      3    64     2.000000
## 4      4   189     1.880952
## 5      5   102     1.750000

Lack of family educational support and bad quality of family relationships seem to correlate with higher alcohol consumption. Additionally, high alcohol consumption correlates with high absences from school. In addition, students with high alcohol consumption (alc_use > 3.0) study less than students with low alcohol comsumption (alc_use < 2.5). These findings are concordant with the primary hypotheses.

Logistic regression

alc$famrel <- factor(alc$famrel)
alc$studytime <- factor(alc$studytime)
m <- glm(high_use ~ absences + famsup + alc$studytime + alc$famrel, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + famsup + alc$studytime + 
##     alc$famrel, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1803  -0.8226  -0.6503   1.1522   2.2409  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.015455   0.878444  -1.156 0.247693    
## absences        0.076934   0.022630   3.400 0.000675 ***
## famsupyes      -0.061188   0.245702  -0.249 0.803337    
## alc$studytime2 -0.439698   0.267726  -1.642 0.100520    
## alc$studytime3 -1.342617   0.442054  -3.037 0.002388 ** 
## alc$studytime4 -1.279644   0.588173  -2.176 0.029583 *  
## alc$famrel2     0.936030   0.964897   0.970 0.332005    
## alc$famrel3     0.643055   0.883261   0.728 0.466585    
## alc$famrel4     0.272456   0.858654   0.317 0.751012    
## alc$famrel5    -0.006842   0.876748  -0.008 0.993773    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 427.24  on 372  degrees of freedom
## AIC: 447.24
## 
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                       OR      2.5 %     97.5 %
## (Intercept)    0.3622376 0.04877998  1.8120704
## absences       1.0799706 1.03533984  1.1316272
## famsupyes      0.9406468 0.58232256  1.5284716
## alc$studytime2 0.6442312 0.38110876  1.0908572
## alc$studytime3 0.2611613 0.10353967  0.5966021
## alc$studytime4 0.2781364 0.07612213  0.8066302
## alc$famrel2    2.5498396 0.42150543 21.4386194
## alc$famrel3    1.9022832 0.37564722 14.2026417
## alc$famrel4    1.3131852 0.27385709  9.4789868
## alc$famrel5    0.9931812 0.19882276  7.3471751

According to the logistic regression model higher absences and studying more than 5 hours a week correlates with lower alcohol consumption. Lack of family support and quality of family relationships do not correlate significantly with alcohol cosumption in this model.

Explore the predictive power

probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()

table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   253   15
##    TRUE     94   20
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table()
##         prediction
## high_use      FALSE       TRUE
##    FALSE 0.66230366 0.03926702
##    TRUE  0.24607330 0.05235602
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66230366 0.03926702 0.70157068
##    TRUE  0.24607330 0.05235602 0.29842932
##    Sum   0.90837696 0.09162304 1.00000000
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2853403

According to the results, 29% of the predictions were wrong. This is clearly less than simple guessing (50% predictions wrong), although not optimal.

Bonus - Perform 10-fold cross-validation on your model

library(boot)
#cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
#cv
#cv$delta[1]

The test set performance is worse as the model makes 31% of predictions wrong.


Exercise 4 Data analysis

Load the data

#setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

Explore the dataset

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The Boston data frame has 506 rows and 14 columns. The data frame contains the following variables: crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat and medv.

Visualize the correlation matrix

pairs(Boston)

As the image is not very visual, let’s develop that with the corrplot package

library(dplyr)
cor_matrix<-cor(Boston) 
cor_matrix
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593
##            ptratio       black      lstat       medv
## crim     0.2899456 -0.38506394  0.4556215 -0.3883046
## zn      -0.3916785  0.17552032 -0.4129946  0.3604453
## indus    0.3832476 -0.35697654  0.6037997 -0.4837252
## chas    -0.1215152  0.04878848 -0.0539293  0.1752602
## nox      0.1889327 -0.38005064  0.5908789 -0.4273208
## rm      -0.3555015  0.12806864 -0.6138083  0.6953599
## age      0.2615150 -0.27353398  0.6023385 -0.3769546
## dis     -0.2324705  0.29151167 -0.4969958  0.2499287
## rad      0.4647412 -0.44441282  0.4886763 -0.3816262
## tax      0.4608530 -0.44180801  0.5439934 -0.4685359
## ptratio  1.0000000 -0.17738330  0.3740443 -0.5077867
## black   -0.1773833  1.00000000 -0.3660869  0.3334608
## lstat    0.3740443 -0.36608690  1.0000000 -0.7376627
## medv    -0.5077867  0.33346082 -0.7376627  1.0000000
cor_matrix %>% round(2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
#install.packages("corrplot")
library(corrplot)
corrplot(cor_matrix %>% round(2), method="circle")

corrplot(cor_matrix %>% round(2), method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

There seem to be a strong negative correlation between variables age (proportion of owner-occupied units built prior to 1940) and dis (weighted mean of distances to five Boston employment centres), nox (nitrogen oxides concentration) and dis, indus (proportion of non-retail business acres per town) and dis and lstat (lower status of the population) and medv (median value of owner-occupied homes in $1000s). On the other hand, there is strong positive correalation between indus and nox, indus and tax (full-value property-tax rate per $10,000), nox and age, nox and tax, rm (average number of rooms per dwelling) and medv and rad (index of accessibility to radial highways) and tax.

By looking at the difference between the medians and means and at their value respective to the min. and max. values, it seems that variables crim, zn, age, rad, tax, indus and black would be distributed non-parametrically and others perhaps parametrically. However, to make such conclusions, one should first run a test for normality (for instance Kolmogorov-Smirnov).

Standardize the dataset

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Scaling reduces the mean of each of the values. Thus, after scaling, the variables have all 0 as a mean and it is easier to inspect the distribution. The closer the median is to 0, the more parametrical the variable. Scaling does not transform the data such as logarithmic trasformation.

Change the object to data frame and create a categorical variable of the crime rate

class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
scaled_crim <- boston_scaled$crim
summary(scaled_crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419400 -0.410600 -0.390300  0.000000  0.007389  9.924000

Create a quantile vector of crim

bins <- quantile(scaled_crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610

Create a categorical variable ‘crime’

crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

#remove original crim from the dataset
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Choose randomly 80% of the rows

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)

Create train and test sets

train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Linear discriminant analysis

#install.packages("lda")
library(lda)
lda.fit <- lda(crime~., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2400990 0.2500000 0.2648515 0.2450495 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       0.9367670 -0.9192932 -0.10997442 -0.8709345  0.47879112
## med_low  -0.1253317 -0.2571739 -0.03844192 -0.5692501 -0.15762951
## med_high -0.3814499  0.1717640  0.09562423  0.3712930  0.04404255
## high     -0.4872402  1.0149946 -0.11325431  1.0622652 -0.40753554
##                 age        dis        rad        tax     ptratio
## low      -0.8365559  0.8354023 -0.6965304 -0.7567854 -0.41041369
## med_low  -0.3657588  0.3475316 -0.5395408 -0.4385409 -0.05583518
## med_high  0.4217848 -0.3554530 -0.4430577 -0.3355635 -0.24494840
## high      0.8032995 -0.8336607  1.6596029  1.5294129  0.80577843
##                black       lstat        medv
## low       0.38455308 -0.77147434  0.54064282
## med_low   0.31406046 -0.11525783 -0.02315994
## med_high  0.05217546  0.05100453  0.12792499
## high     -0.87646727  0.86583729 -0.77280345
## 
## Coefficients of linear discriminants:
##                 LD1          LD2        LD3
## zn       0.12048851  0.731923803 -1.0135288
## indus    0.04685408 -0.339015574  0.2724238
## chas    -0.00582327 -0.029055786  0.1114995
## nox      0.27652793 -0.778624791 -1.3301345
## rm       0.04921114  0.007748152 -0.0944418
## age      0.23447836 -0.291399166 -0.3216750
## dis     -0.08298416 -0.332508100  0.1107297
## rad      3.68954852  0.833193212 -0.1716712
## tax     -0.01324991  0.117864760  0.7763480
## ptratio  0.15271757 -0.027330164 -0.3974084
## black   -0.11911434  0.033326600  0.1346439
## lstat    0.11398414 -0.325734336  0.3741871
## medv    -0.02968968 -0.548674036 -0.2386007
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9549 0.0327 0.0124

The function for lda biplot arrows

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

Target classes as numeric

classes <- as.numeric(train$crime)

Plot the lda results

plot(lda.fit, dimen = 2)

plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

Save the correct classes from test data and remove the crime variable

correct_classes <- test$crime
test <- dplyr::select(test, -crime)

Predict the classes with the LDA model on the test data and cross tabulate the results with the crime categories from the test set

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       19       9        2    0
##   med_low    4      16        5    0
##   med_high   0       4       13    2
##   high       0       0        1   27

According to the results, the model predicts well the true values of the categorical crime variable. Most of the errors are related to the med_low group where the specificity is the lowest.

Calculate the distances between the observations and run k-means algorithm on the dataset

#data('Boston')
#boston_scaled <- scale(Boston)
dist_eu <- dist(boston_scaled)
## Warning in dist(boston_scaled): NAs introduced by coercion
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1394  3.5270  4.9080  4.9390  6.2420 13.0000
dist_man <- dist(boston_scaled, method = "manhattan")
## Warning in dist(boston_scaled, method = "manhattan"): NAs introduced by
## coercion

K-means algorithm

Determine and visualize the optimal number of clusters by first assigning a max of 10 clusters and then calculating the total within sum of squares

k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b')

The optimal cluster number is 2. K-means cluster analysis

km <-kmeans(dist_eu, centers = 2)
pairs(Boston, col = km$cluster)

Bonus: Run k-means with 3 centers and perform LDA using the clusters as target classes

km <-kmeans(dist_eu, centers = 3)
lda.fit <- lda(km$cluster~., data = boston_scaled)
lda.fit
## Call:
## lda(km$cluster ~ ., data = boston_scaled)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.4426877 0.2075099 0.3498024 
## 
## Group means:
##           zn      indus        chas        nox         rm        age
## 1 -0.2309344 -0.4551440 -0.27232907 -0.4915087 -0.1311176 -0.2883074
## 2  1.3140079 -0.9068835  0.47759479 -0.7455383  1.0548777 -0.7434166
## 3 -0.4872402  1.1139831  0.06132349  1.0642908 -0.4598408  0.8058735
##          dis        rad        tax      ptratio      black      lstat
## 1  0.2864417 -0.5768315 -0.6147205 -0.006886361  0.3154274 -0.2702405
## 2  0.7952165 -0.5935800 -0.6372993 -0.976736455  0.3398644 -0.8793299
## 3 -0.8342410  1.0821251  1.1560103  0.588134874 -0.6007995  0.8636357
##          medv crimemed_low crimemed_high crimehigh
## 1 -0.02274038   0.43750000     0.2589286 0.0000000
## 2  1.16160305   0.16190476     0.2761905 0.0000000
## 3 -0.66030777   0.06214689     0.2203390 0.7175141
## 
## Coefficients of linear discriminants:
##                        LD1         LD2
## zn             0.172610013 -1.25618179
## indus          1.078970813 -0.12226466
## chas           0.004463932 -0.49433521
## nox            0.921052125 -0.16586799
## rm            -0.040950198 -0.49922411
## age           -0.067015888  0.07618846
## dis            0.138379223 -0.06104503
## rad            0.579745765  0.43033468
## tax            0.401343141 -0.66310399
## ptratio        0.374368029  0.11727097
## black         -0.061203934  0.05315043
## lstat          0.376322744 -0.56677965
## medv           0.109367758 -0.69236403
## crimemed_low  -0.389791316  0.15910228
## crimemed_high -0.486827228 -0.58792893
## crimehigh     -0.207144258 -1.08365331
## 
## Proportion of trace:
##    LD1    LD2 
## 0.7951 0.2049
plot(lda.fit, dimen = 2)

plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

Using 3 clusters for K-means analysis, the most influencial linear separators are nox and chas. This means that if the data would be divided in 3 classes, variables nox and chas would explain most of the dimentional difference and would thus be the best variables to predict possible new observations.

Super-Bonus: Create a matrix product, which is a projection of the data points

model_predictors <- dplyr::select(train, -crime)
###Check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 16  2
###Matrix multiplication
#matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
#matrix_product <- as.data.frame(matrix_product)
#install.packages("plotly")
#library("plotly")
###Create a 3D plot (Cool!) of the columns of the matrix product
#plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)
#plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=km$center)

The same results as above is visualised in 3D using the variable crime in the first plot and the number of clusters selected in the second to separate observations with colors. For some reason, the visualisation does not work after knitting the file.